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FIG. 2 is a flow diagram illustrating one simplified management architecture constructed in accordance with the 

exemplary method embodiment of the present invention. principles of the invention operates in an au tomated manner 

FIG. 3 is a flow diagram illustrating one exemplary to collect error information, evaluate the error information, 

method embodiment of a fault management exercise of the and diagnose faults associated with the error information, 

present invention. 5 Additionally, the fault management architecture takes action 

FIGS. 4A and 4B illustrate an embodiment of a computer to resolve the faults. Such resolution can also be automated, 

system suitable for implementing embodiments of the Embodiments of the fault management architecture operate 

present invention. at the user level of the operating system (O/S) and not at the 

It is to be understood that, in the drawings, like reference kernel level and so do not require that the system be taken 

numerals designate like structural elements. Also, it is 10 offline in order to operate. In fact, the fault management 

understood that the depictions in the Figures are not neces- architecture of the present invention can be continuously 

sarily to scale. operating whenever the system is operating. Moreover, the 

fault management architecture can be readily updated with 

DETAILED DESCRIPTION OF THE DRAWINGS improved features without taking the system offline. For 
15 example, new or updated diagnostic engines 102 and fault 

The present invention has been particularly shown and correction agents 103 can be added to (or removed from) the 

described with respect to embodiments and specific features system while the computer system is operating without 

thereof. The embodiments set forth herein below are to be interfering with the normal operation of the computer sys- 

taken as illustrative rather than limiting. It should be readily tem. 

apparent to those of ordinary skill in the art that various 20 For purposes of this disclosure there is a user level and a 

changes and modifications in form and detail may be made kernel level. System and end-user application software runs 

without departing from the spirit and scope of the invention. at the "user-level". Additionally, there is a kernel level. As 

When a computer system encounters a system interrupt is known to those having ordinary skill in the art, the kernel 

(an error) the computer system can begin to function errati- is a special program that manages system resources (e.g., 

cally or fail completely. A computer system error is a 25 software and hardware). The kernel insulates applications 

symptom detected by the computer system in response to a from system hardware while providing them with controlled 

fault (i.e., the underlying problem in the system that caused aeeess to hardware and essential system services including, 

the error). Typical examples of such errors include com- but not limited to I/O management, virtual memory, and 

mantis that lime out, bus errors, I/O errors, ECC memory scheduling. 

(Error-Correcting Code memory) errors, unexpected soft- 30 FIG. 1 depicts one example of a suitable fault manage- 

ware results, and the like. Other errors include the typical ment architecture constructed in accordance with the prin- 

256 software interrupts that are commonly found on inter- ciples of the invention. In the depicted embodiment, the fault 

rupt vector tables. Such software interrupts are commonly management architecture 100 operates in a computer system 

referred to as traps or exceptions. Other error examples at the user level. The advantage of operating at the user level 

include hardware interrupts (e.g., IRQ line failures etc.). The 35 means that the operation of the fault management architec- 

faults that cause such errors are legion. A few common ture does not interfere with the operation of the kernel. Thus, 

examples include device failures, bus line failures, discon- the computer system can operate effectively at the same time 

nected cables, memory failures, and many, many more. It is the fault management architecture is operating. The fault 

important that faults causing these errors be identified and management architecture includes a fault manager 101, 

corrected as soon as possible to enable efficient system 40 which includes a plurality of diagnostic engines 102 (e.g., 

operation. DE„ DE 2 , . . . DE„) and a plurality of fault correction agents 

The embodiments of the present invention go beyond 103 (e.g., A 1; A 2 , . . . A m ). The fault manager 101 can 

current approaches to fault diagnosis and correction and do optionally include a soft error rate discriminator (SERD) 

not require extensive manual action on the part of the system 105 whose function and utility will be explained in greater 

administrator. The embodiments go beyond approaches that 45 detail hereinbelow. The fault management architecture 100 

are limited to general error reporting and rudimentary guid- also includes a data capture engine 110. In some embodi- 

ance as to which diagnostic tools may be useful in finding ments, the data capture engine 110 can optionally be 

the responsible fault. The embodiments of the invention do included as part of the fault manager 101 itself. Another 

not always require the system administrator to evaluate advantage of operating the fault management architecture at 

errors to determine which diagnostic tools to use next and 50 the user level is that the diagnostic engines 102 and the fault 

then acquire further error information in order to diagnose correction agents 103 can be plugged into (or unplugged 

re of the fault. The embodiments of the invention can from) the computer system without interfering ni 



[ programatticallyl take action to correct faults. The systems tem operation. The process of capturing data through fault 
ind netl i ndiments of the invention can operate with diagnosis and resolution is referred to as a fault management 
the system "on-line". This goes beyond existing approaches 55 exercise. Processes and methods for facilitating such fault 
have no ability to capture data, diagnose faults, and correct management exercises are described in greater detail else- 
faults "on the fly" (while the system is online operating where herein. 

normally). Additionally, embodiments of the present inven- Referring again to FIG. 1, the data capture engine 110 is 
tion are readily extensible. Thus, when new diagnostic tools a set of computer readable program instructions for receiv- 
become available, they can simply be plugged into the 60 ing and processing error information from the computer 
system and used. There is no need for the system to be taken system. For example, the data capture engine 110 can 
offline and no need for the entire messaging sub-scheme to capture error information in many different software com- 
be reconfigured in order to patch in the new tool as is the ponents (and resources) including, but not limited to, a 
case with conventional approaches. kernel module, device drivers, trap handlers, interrupt han- 
The following detailed description describes various 65 dlers, and user-level applications. The data capture engine 
method and apparatus embodiments of a fault management 110 passes this error information to the fault manager 101 
architecture used in a computer system. In general, a fault for further processing. The data capture engine 110 operates 
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least one of: an analysis of at least one of computer resource 
failure history, system management policy, and relative 
probability of occurrence for each fault possibility. 

15. The fault management architecture of claim 1 wherein 
the fault manager stores provided error reports in a log 5 
comprising an error report log and wherein the error report 
log tracks the status of the provided error reports. 

16. The fault management architecture of claim 1 wherein 
the fault manager includes a soft error rate discriminator 
that: 10 

receives error information concerning correctable errors; 

wherein the soft error rate discriminator is configured so 
that when the number and frequency of correctable 
errors exceeds a predetermined threshold number of 
correctable errors over a predetermined threshold 15 
amount of time, these errors are deemed recurrent 
correctable errors that are sent to the diagnostic engines 
for further analysis; 

wherein the diagnostic engine receives a recurrent cor- 
rectible error message and 20 

diagnoses a set of fault possibilities associated with the 
recurrent correctible error message; and 

wherein a fault correction agent receives the set of fault 
possibilities from the diagnostic engines and then 
resolves the diagnosed fault. 25 

17. The fault management architecture of claim 16 
wherein the soft error rate discriminator receives error 
information concerning correctable errors from the diagnos- 
tic engine. 

18. The fault management architecture of claim 16 30 
wherein the diagnostic engine that identifies a set of fault 
possibilities associated with the recurrent correctable error 
message further determines associated probabilities of 
occurrence for the set of fault possibilities associated with 
the recurrent correctable error message. 3 

19. The fault management architecture of claim 18 
wherein thejajfault correction agent receives the set of fault 



possibilities and associated probabilities of occurrence from 
the diagnostic engines and the agent then takes appropriate 
action to resolve the set of fault possibilities. 

20. The fault management architecture of claim 1 wherein 
the fault manager includes a soft error rate discriminator 
that: 

receives error information concerning soft errors; 

wherein the soft error rate discriminator is configured so 
that when the number and frequency of soft errors 
exceeds a predetermined threshold number of soft 
errors over a predetermined threshold amount of time, 
these soft errors are deemed recurrent soft errors that 
are sent to the diagnostic engines for further analysis; 

wherein the diagnostic engine receives a recurrent soft 
error message and diagnoses a set of fault possibilities 
associated with the recurrent correctable error message; 
and 

wherein a fault correction agent receives the set of fault 
possibilities from the diagnostic engines and then 
resolves the diagnosed fault. 

21. The fault management architecture of claim 1 further 
including a fault management administrative tool that is 
configured to enable a user to access the logs to determine 
the fault status and error history of resources in the computer 
system. 

22. The fault management architecture of claim 1 further 
including a fault management statistical file that can be 
reviewed to determine the effectiveness of the diagnostic 
engines and fault correction agents at diagnosing faults and 
resolving faults. 

23. The fault management architecture of claim 1 wherein 
the computer system comprises a single computer device. 

24. The fault management architecture of claim 1 wherein 
the computer system comprises a plurality of computers 
forming a network. 



